The complete dataset includes 17 independent variables and 1 dependent variable. Thanks to their nature, the independent variables were classified in three groups: Freedom Scores, Work Variation and Ethnic Variation.
Individual EDA of Freedom scores
## Observations per group: 17391, 17605, 21042, 12633. 1001 missing.
## Factor w/ 4 levels "[-0.827,-0.197]",..: 3 3 3 3 3 3 3 3 3 3 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.8272 -0.1968 0.0176 -0.0811 0.1376 0.3550 1001


## Observations per group: 17296, 17735, 19513, 14127. 1001 missing.
## Factor w/ 4 levels "[-0.0444,0.0135]",..: 1 1 1 1 1 1 1 1 1 1 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.0444 0.0135 0.0803 0.0642 0.1064 0.2450 1001


## Observations per group: 17603, 16971, 17167, 16930. 1001 missing.
## Factor w/ 4 levels "[-0.457,-0.223]",..: 3 3 3 3 3 3 3 3 3 3 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.4569 -0.2228 -0.0737 -0.1470 -0.0322 0.0715 1001


## Observations per group: 17952, 18087, 15977, 16655. 1001 missing.
## Factor w/ 4 levels "[-0.37,-0.0202]",..: 3 3 3 3 3 3 3 3 3 3 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.3702 -0.0202 0.0634 0.0660 0.1767 0.4024 1001


## Observations per group: 17412, 18197, 16341, 16721. 1001 missing.
## Factor w/ 4 levels "[-0.814,-0.0878]",..: 2 2 2 2 2 2 2 2 2 2 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.8136 -0.0878 0.0794 -0.0169 0.1637 0.4614 1001


Next, the seventeen independent variables were analyzed. The freedom scores were economic freedom, personal freedom, regulatory policy, fiscal policy, and overall freedom. The box plots were split up into four evenly distributed quartiles by the income per capita in each quartile. For all the five sets of boxplots, there did not appear to be any differences between the quartiles as they all overlapped roughly the same range of their respective independent variables. The histograms did not appear normal as overall the data was randomly spread out with huge gaps between bins. The Q-Q plots told a similar story as the error terms tended to follow a sin-like trend over the line and there were big tails on either end. None of the freedom scores appeared to be distributed normally. ## Individual EDA of Work Variations
## Observations per group: 17922, 17175, 17218, 17256. 101 missing.
## Factor w/ 4 levels "[0,5.3]","(5.3,7.9]",..: 2 4 2 3 1 3 3 3 3 2 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.300 7.900 9.251 11.600 100.000 101


## Observations per group: 17557, 17275, 17411, 17324. 105 missing.
## Factor w/ 4 levels "[0,23.7]","(23.7,31.7]",..: 3 1 2 2 4 2 1 4 2 2 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 23.70 31.70 33.23 41.80 100.00 105


## Observations per group: 17476, 17574, 17443, 17074. 105 missing.
## Factor w/ 4 levels "[0,20.3]","(20.3,23.9]",..: 2 2 2 3 1 4 3 4 3 1 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 20.30 23.90 24.12 27.70 100.00 105


## Observations per group: 17771, 17028, 17599, 17169. 105 missing.
## Factor w/ 4 levels "[0,14.1]","(14.1,18.3]",..: 2 4 4 3 2 2 4 1 2 1 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 14.10 18.30 19.65 24.00 100.00 105


## Observations per group: 17544, 17264, 17575, 17184. 105 missing.
## Factor w/ 4 levels "[0,5.4]","(5.4,8.7]",..: 3 3 3 2 1 2 3 2 2 3 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.400 8.700 9.636 12.800 100.000 105


## Observations per group: 17456, 17586, 17245, 17280. 105 missing.
## Factor w/ 4 levels "[0,7.7]","(7.7,12.3]",..: 3 4 3 3 3 3 3 1 3 4 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.70 12.30 13.36 17.80 100.00 105


## Observations per group: 17940, 17233, 17140, 17254. 105 missing.
## Factor w/ 4 levels "[0,3.5]","(3.5,5.4]",..: 2 3 4 1 2 3 1 3 2 3 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.500 5.400 6.109 7.900 100.000 105


Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality. The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased. The only work variation that exhibited an increase in average income was professional work. The remaining variables of office and self-employed remained relatively stable across quartiles. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. The remaining six work variations were all skewed to the right. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.
## Individual EDA of ethnicities
## Observations per group: 18357, 16720, 17177, 17418. 0 missing.
## Factor w/ 4 levels "[0,0.8]","(0.8,4]",..: 3 4 4 2 4 3 4 3 3 3 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.80 4.00 13.78 15.32 100.00


## Observations per group: 17726, 17191, 17343, 17412. 0 missing.
## Factor w/ 4 levels "[0,2.4]","(2.4,7.2]",..: 1 1 1 3 1 3 2 1 1 1 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.40 7.20 17.36 21.50 100.00


## Observations per group: 17518, 17416, 17489, 17249. 0 missing.
## Factor w/ 4 levels "[0,0.1]","(0.1,1.2]",..: 2 3 3 1 3 1 1 1 1 2 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.100 1.200 4.347 4.400 91.300


## Observations per group: 17437, 17434, 17487, 17314. 0 missing.
## Factor w/ 4 levels "[0,37.1]","(37.1,70.3]",..: 3 2 3 3 2 3 3 3 4 3 ...

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 37.10 70.30 61.24 88.40 100.00


## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7567 0.4000 100.0000


Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated. The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth. The boxplot for Asian showed an increase from the first through the fourth quartile. The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly. The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract. The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native. All of the error terms along the Q-Q plot line for each of the ethnicity variables followed a curve with large left and right tails. Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. For the native Q-Q plot, there was a clear pattern of the error terms along the line implying non-normality. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.
#Preprocessing KNN
## Observations per group: 34838, 34834. 0 missing.
## Factor w/ 2 levels "[128,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...

## 'data.frame': 69672 obs. of 12 variables:
## $ TotalPop : int 1948 2156 2968 4423 10763 3851 2761 3187 10915 5668 ...
## $ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
## $ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
## $ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
## $ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
## $ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
## $ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
## $ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
## $ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
## $ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
## $ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
## $ ipc : Factor w/ 2 levels "[128,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
## [1] 626
## [1] 0
## [1] 12
## 'data.frame': 69567 obs. of 12 variables:
## $ TotalPop : int 1948 2156 2968 4423 10763 3851 2761 3187 10915 5668 ...
## $ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
## $ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
## $ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
## $ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
## $ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
## $ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
## $ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
## $ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
## $ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
## $ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
## $ ipc : Factor w/ 2 levels "[128,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
## - attr(*, "na.action")= 'omit' Named int 1484 1807 2299 2499 2789 4259 4444 4448 4449 4477 ...
## ..- attr(*, "names")= chr "1514" "1851" "2370" "2574" ...